note: I used the Lux theme for this document, which I got here

Assignment 9

Below are the instructions for this assignment

For credit…

  1. Push a completed version of your Rproj and R-markdown file (details at end of this assignment) to GitHub
  2. Your score will also depend on your analyses and presentation of your final report

Your tasks:

So, let’s get started




Set up the data

First, load the data and packages

library(tidyverse)
library(ggplot2)
library(aod)
library(plotly)

dat <- read.csv("./Data/GradSchool_Admissions.csv")

Next, get an overview of the data

head(dat)
##   admit gre  gpa rank
## 1     0 380 3.61    3
## 2     1 660 3.67    3
## 3     1 800 4.00    1
## 4     1 640 3.19    4
## 5     0 520 2.93    4
## 6     1 760 3.00    2

Okay, we see we are working with school admittions, where admit is accepted (1) or not accepted (0), their gre test score, gpa, and school rank where 1 is the highest rank

Let’s change rank to a factor

dat$rank <- factor(dat$rank)

Model the data

Create a model

mod <- glm(admit ~ gre + gpa + rank, data = dat, family = "binomial")
mod
## 
## Call:  glm(formula = admit ~ gre + gpa + rank, family = "binomial", 
##     data = dat)
## 
## Coefficients:
## (Intercept)          gre          gpa        rank2        rank3        rank4  
##   -3.989979     0.002264     0.804038    -0.675443    -1.340204    -1.551464  
## 
## Degrees of Freedom: 399 Total (i.e. Null);  394 Residual
## Null Deviance:       500 
## Residual Deviance: 458.5     AIC: 470.5

View summary of the model

summary(mod)
## 
## Call:
## glm(formula = admit ~ gre + gpa + rank, family = "binomial", 
##     data = dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6268  -0.8662  -0.6388   1.1490   2.0790  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.989979   1.139951  -3.500 0.000465 ***
## gre          0.002264   0.001094   2.070 0.038465 *  
## gpa          0.804038   0.331819   2.423 0.015388 *  
## rank2       -0.675443   0.316490  -2.134 0.032829 *  
## rank3       -1.340204   0.345306  -3.881 0.000104 ***
## rank4       -1.551464   0.417832  -3.713 0.000205 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 499.98  on 399  degrees of freedom
## Residual deviance: 458.52  on 394  degrees of freedom
## AIC: 470.52
## 
## Number of Fisher Scoring iterations: 4

What about an R-squared value?

summary(lm(mod))$r.squared
## [1] 0.1004006

Okay, pretty low… Our model might not be the best but don’t sweat the small stuff

Viewing confidence intervals

Use the confint function to obtain confidence intervals (CIs) for the coefficient estimates

confint(mod)
##                     2.5 %       97.5 %
## (Intercept) -6.2716202334 -1.792547080
## gre          0.0001375921  0.004435874
## gpa          0.1602959439  1.464142727
## rank2       -1.3008888002 -0.056745722
## rank3       -2.0276713127 -0.670372346
## rank4       -2.4000265384 -0.753542605

We can exponentiate the coefficients and interpret them as odds-ratios. We can use the same logic to get probability ratios and their confidence intervals, by exponentiating the confidence intervals from before

exp(coef(mod))
## (Intercept)         gre         gpa       rank2       rank3       rank4 
##   0.0185001   1.0022670   2.2345448   0.5089310   0.2617923   0.2119375

To put it all in one table, we use cbind to bind the coefficients and confidence intervals column-wise

## odds ratios and 95% CI
exp(cbind(OR = coef(mod), confint(mod)))
##                    OR       2.5 %    97.5 %
## (Intercept) 0.0185001 0.001889165 0.1665354
## gre         1.0022670 1.000137602 1.0044457
## gpa         2.2345448 1.173858216 4.3238349
## rank2       0.5089310 0.272289674 0.9448343
## rank3       0.2617923 0.131641717 0.5115181
## rank4       0.2119375 0.090715546 0.4706961

We interpret this to mean that for a one unit increase in gpa, the odds of being admitted to graduate school (versus not being admitted) increase by a factor of 2.23

Predictions

Let’s use predicted probabilities to help us understand the model. In order to create predicted probabilities we first need to create a new data frame with the values we want the independent variables to take on to create our predictions

First, calculate the predicted probability of admission at each value of rank, holding gre and gpa at their means, and view the new data frame

newdata1 <- with(dat, data.frame(gre = mean(gre), gpa = mean(gpa), rank = factor(1:4)))

## view data frame
newdata1
##     gre    gpa rank
## 1 587.7 3.3899    1
## 2 587.7 3.3899    2
## 3 587.7 3.3899    3
## 4 587.7 3.3899    4

Great, now we have the data frame we want to use to calculate the predicted probabilities

Let’s add the probablility values to the data set

newdata1$rankP <- predict(mod, newdata = newdata1, type = "response")
newdata1
##     gre    gpa rank     rankP
## 1 587.7 3.3899    1 0.5166016
## 2 587.7 3.3899    2 0.3522846
## 3 587.7 3.3899    3 0.2186120
## 4 587.7 3.3899    4 0.1846684

By reading the above chart, we understand that (while holding gre and gpa at their means) the predicted probability of being accepted into a graduate program is 0.52 for students from the highest prestige undergraduate institutions (rank=1)

We can do something very similar to create a table of predicted probabilities varying the value of gre and rank. To plot these, we will create 100 values of gre between 200 and 800, at each value of rank (i.e., 1, 2, 3, and 4)

newdata2 <- with(dat, data.frame(gre = rep(seq(from = 200, to = 800, length.out = 100),
    4), gpa = mean(gpa), rank = factor(rep(1:4, each = 100))))

Then, we generate the predicted probabilities except we are also going to ask for standard errors (se = TRUE) so we can plot a confidence interval. We get the estimates on the link scale and back transform both the predicted values and confidence limits into probabilities.This gives us our final data set, with all the values we want

newdata3 <- cbind(newdata2, predict(mod, newdata = newdata2, type = "link",
    se = TRUE))
newdata3 <- within(newdata3, {
    PredictedProb <- plogis(fit)
    LL <- plogis(fit - (1.96 * se.fit))
    UL <- plogis(fit + (1.96 * se.fit))
})


head(newdata3)
##        gre    gpa rank        fit    se.fit residual.scale        UL        LL
## 1 200.0000 3.3899    1 -0.8114870 0.5147714              1 0.5492064 0.1393812
## 2 206.0606 3.3899    1 -0.7977632 0.5090986              1 0.5498513 0.1423880
## 3 212.1212 3.3899    1 -0.7840394 0.5034491              1 0.5505074 0.1454429
## 4 218.1818 3.3899    1 -0.7703156 0.4978239              1 0.5511750 0.1485460
## 5 224.2424 3.3899    1 -0.7565919 0.4922237              1 0.5518545 0.1516973
## 6 230.3030 3.3899    1 -0.7428681 0.4866494              1 0.5525464 0.1548966
##   PredictedProb
## 1     0.3075737
## 2     0.3105042
## 3     0.3134499
## 4     0.3164108
## 5     0.3193867
## 6     0.3223773

Visualizing the new data

Make a new plot with the new data set, showing the predicted probability that students were accepted given their gre and school ranking

p <- ggplot(newdata3, aes(x = gre, y = PredictedProb)) + geom_ribbon(aes(ymin = LL,
    ymax = UL, fill = rank), alpha = 0.2) + geom_line(aes(colour = rank),
    size = 1)
p

That looks’s a little boring, so let’s use an interactive plot using the plotly package

fig <- plot_ly( type = 'scatter', mode = 'markers')
fig <- fig %>% 
  add_trace(data = newdata3,
    x = ~gre,
    y = ~PredictedProb,
    hoverinfo = 'image',
    color = newdata3$rank,
    labels = newdata3$rank,
    showlegend = TRUE
    )
fig

Conclusions

Whether we use the first or the second graph, we can see the relationships between the given varialbes. There appears to be a positive relationship between gre, rank and probability of being accepted. In other words, student’s who had a higher gre and rank had a higher change of acceptance than those who scored lower and came from schools of less prestige.